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Abstract 

This work considers the problem of learning linear Bayesian networks when some of the vari- 
ables are unobserved. Identifiability and efficient recovery from low-order observable moments are 
established under a novel graphical constraint. The constraint concerns the expansion properties 
of the underlying directed acyclic graph (DAG) between observed and unobserved variables in 
the network, and it is satisfied by many natural families of DAGs that include multi-level DAGs, 
DAGs with effective depth one, as well as certain families of polytrees. 



1 Introduction 



It is widely recognized that incorporating latent or hidden variables is a crucial aspect of modeling. 
Latent variables can provide a succinct representation of the observed data through dimensionality 
reduction; the possibly many observed variables are summarized by fewer hidden effects. Further, 
they are central to predicting causal relationships and interpreting the hidden effects as unobservable 
concepts. For instance in sociology, human behavior is affected by abstract notions such as social 
attitudes, beliefs, goals and plans. As another example, medical knowledge is organized into casual 
hierarchies of invading organisms, physical disorders, pathological states and symptoms, and only 
the symptoms are observed. 

In addition to incorporating latent variables, it is also important to model the complex depen- 
dencies among the variables. A popular class of models for incorporating such dependencies are the 
Bayesian networks, also known as belief networks. They incorporate a set of causal and conditional 
independence relationships through directed acyclic graphs (DAG) [36]. They have widespread ap- 
plicability in artificial intelligence 13 19 31 32], in the social sciences [9j[l2j|30j[37][38j[5l], and as 



structural equation models in economics (^[I2j[24j[38j[47j[52] 

An important statistical task is to learn such latent Bayesian networks from observed data. This 
involves discovery of the hidden variables, structure estimation (of the DAG) and estimation of the 
model parameters. Typically, in the presence of hidden variables, the learning task suffers from 
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(a) Multi-level DAG (b) DAG with effective depth one 

Figure 1: Illustrations of multi- level DAGs and DAGs with effective depth one. Observed nodes and 
hidden nodes are respectively shown by shaded and white circles. Under the expansion property for 
the graph and the linear dependence model (Section |2.2[ ), we prove identifiability of these ensembles 
from low order moments of the observed variables. 



identifiability issues since there may be many models which can explain the observed data. In order 
to overcome indeterminacy issues, one must restrict the set of possible models. We establish novel 
criteria for identifiability of latent DAG models using only low order observed moments (second /third 
moments) . We introduce a graphical constraint which we refer to as the expansion property. Roughly 
speaking, expansion property states that every subset of hidden nodes has "enough" number of 
outgoing edges, so they have a noticeable influence on the observed nodes, and thus on the samples 
drawn from the joint distribution of the observed nodes. This notion implies new identifiability and 
learning results for DAG structures. More specifically, we show that under this constraint, some 
broad families of DAG models with hidden variables, including multi-level DAGs and DAGs with 
effective depth one, which includes (a subset of) trees and polytree^] satisfy this constraint and 
are thus, identifiable from only second and third observed moments. In addition, we propose novel 
and efficient algorithms for the learning task which leverage on the ideas from sparse recovery and 
dictionary learning |46] as well as from spectral methods for inverse moment problems [4]. 

2 Model and outline of the results 
2.1 Notation 

We write \\v\\ p for the standard £ p norm of a vector v. Specifically, ||f ||o denotes the number of 
non-zero entries in v. Also, refers to the induced operator norm on a matrix M. For a matrix 

M and set of indices I, J, we let Mj denote the submatrix containing just the rows in / and Mjj 
denote the submatrix formed by the rows in / and columns in J. For a vector v, supp(v) represents 

1 A polytree is a directed acyclic graph where ignoring the directions, the graph is a tree. 
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the positions of non-zero entries of v. We use to refer to the i-th standard basis element, e.g., 
e\ = (1,0,... ,0). For a matrix M we let Row(M) (similarly Col(M)) denote the span of its rows 
(columns). For a set S, \S\ is its cardinality. We use the notation [n] to denote the set {1, . . . , n}. 
For a vector v, diag(u) is a diagonal matrix with the elements of v on the diagonal. For a matrix M, 
diag(M) is a diagonal matrix with the same diagonal as M. 

2.2 Model 

We define a DAG model as a pair (Q,¥g), where ¥g is a joint probability distribution, parameterized 
by 9, on n variables x := (x\, . . . , x n ) that is Markov with respect to a DAG Q = (V,£) with 
V = {1, . . . , n} [33]. More specifically, the joint probability Pg(x) factors as 

n 

Fg(x)=]JVe{x i \xp Ai ), (1) 
1=1 

where PAj := { j S V : (j, i) G £} denotes the set of parents of node i in Q. 

The learning task involving DAG models can be described as: Given i.i.d. samples generated 
from the joint distribution ¥g over x$ for some S C V, recover (some part of) the graph structure Q 
and estimate the model parameter 9. 

We consider DAG Q = (V bsU Vhid>£ ) with observed nodes V bs = • • • ,%n} and hidden nodes 
Miid = 0*1 j • • • ' ^k}- Let Si be the noise variable associated to Xi, for i = 1, . . . , n and denote the 
variance of £4 by a 2 e% > 0. Throughout we use the notation h := (hi, . . . , hk), x := {x\, . . . , x n ) and 
e := (e\, . . . , e n ). The noise terms e are assumed to be uncorrelated. The class of models considered 
are specified by the following assumptions. 

Condition 1 (Linearity). The observed and hidden variables obey the mode^ 

Xi = aijhj + Ei, for i G [n]. (2) 
jePA, 

Furthermore, the hidden variables are linearly independent, i.e., with probability one, ^/X^effc] a ihi = 
0, then cti = 0, for all i G [k]. 

We note that without a non-degeneracy assumption on the hidden variables there is no hope of 
distinguishing different hidden nodes. 

Notice that the structure of Q is defined by the non-zero coefficients in Eq. ([2]). Therefore, there 
is no edge among the observed nodes. We define A G M. nxk by letting the (i,j) entry be a%j if j G PAj 
and zero otherwise. We refer to matrix A as the coefficient matrix. 

Remark 2.1. The linear relationships described above can be thought of as linear structural equation 
models (SEM). In general, an SEM is defined by a collection of equations 

Zi = fi{z Pk% ,Ei), (3) 

with Zi be the variables associated to the nodes. Recently, there has been some progress on the 

'. This paper can be viewed 
as a contribution to the problem of identifiability and learning SEMs with latent variables. 

Without loss of generality, assume that 5Cj, Si, hj are all zero mean. 



identifiability problem of SEMs in the fully observed setting 126, 39,40\44 
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We now describe sufficient conditions under which the linear DAG model with hidden variables 
becomes identifiable. Given observations x, note that we can only hope to identify the columns of 
matrix A up to permutation because the model is unchanged if one permute the hidden variables 
h and the columns of A correspondingly. Moreover, the scale of each column of A is also not 
identifiable. To see this, observe that Eq. ^ is unaltered if we both rescale all the coefficients 
{ajj}j g [fc] and appropriately rescale the variable h{. Without further assumptions, we can only hope 
to recover a certain canonical form of A, defined as follows: 

Definition 2.2. We say A is in a canonical form if for each j G [k], o\_ = E[/i|] = 1. In particular, 
the transformation A <— A diag(a/ ll , 07^, . . . , o~h k ) and the corresponding rescaling of h place A in 
canonical form and the distribution over Xi, i G [n], is unchanged. 

Furthermore, observe that the canonical A is only specified up to sign of each column since any 
sign change of column i does not alter the variance of hi. 

We now discuss a rank condition on the coefficient matrix A. 

Condition 2 (Rank condition). There exists a fixed partition V of [n] such that \V\ = 3, and At 
has full column rank for all I G V . 

Since rank(A/) = k, for I G V, we have as a consequence n > \V\ k = 3k. Therefore, it essentially 
states that the number of hidden nodes should be at most one third of the observed ones. In most 
applications, we are looking for a few number of hidden effects that can represent the statistical 
dependence relationships among the observed nodes. Thus the rank condition is reasonable in these 
cases. As we will see later, due to this assumption we can extract the noise term from the observed 
moments. 

We proceed by defining the expansion property of a graph which plays a key role in establishing 
our identifiability results. 

Definition 2.3. Let ri(Vi, V2) be a bipartite DAG with parts Vi and V2, and edges directed from Vi 
to W We say that H(Vi, V2) satisfies the expansion property if for any subset S C V\, with \S\ > 2, 
we have |N(5)| > |5| + cf max , where N(5) := {i G V2 : (j,i) G £ for some j G S} is the set of the 
neighbors of S and <i max is the maximum degree of nodes in V\ . 

Condition 3 (Graph expansion). Let %(Vhid, V Q bs) denote the graph formed by the edges between 
Vhid and Vobs- Then, %(Vhid, V bs) has the expansion property. 

The last condition is a generic assumption on the entries of matrix A. We first define the 
parameter genericity property for a matrix. 

Definition 2.4. We say that matrix M G M. nxk has the parameter genericity property if for any 
v £M. k with \\v\\o > 2, the following holds true. 

\\Mv\\ > |N M (supp(w))| - I Bupp(u)|, (4) 

where for a set S C [k], Nm(S) := {i G [re] : My 7^ for some j G S}. 

Condition 4 (Parameter genericity). The coefficient matrix A has the parameter genericity property. 

This is a mild generic condition. More specifically if the entries of an arbitrary fixed matrix M 
are perturbed independently, then it satisfies the above generic property with probability one. 
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Remark 2.5. Fix any matrix M £ M ?ixfc . Let Z 6 M. nxk be a random matrix such that {Zij : M^j ^ 
0} are independent random variables, and = whenever Mij = 0. Assume each variable is drawn 
from a distribution with uncountable support. Then 

F(M + Z does not satisfy Condition^ = 0. (5) 



Remark 2.5 is proved in Appendix |A} 



2.3 Summary of contributions 

We establish identifiability of different classes of linear DAG models from the observed data, and also 
propose efficient algorithms for the learning task. In the following, we summarize our identifiability 
results and the proposed algorithms. 

Identifiability. Our core result is the following. 



Core result. Under the model assumptions in Section 2.2 one can identify the coefficient matrix 
A from the second order moment E[xx T ], without any assumption on the dependence relationships 
among the hidden nodes. 

This result shows how the graph expansion property enables the identifiability of connectivity 
structure between the set of hidden nodes and the set of observed nodes for a general DAG. It is 
worth noting that the result is obtained using only the second order moments. If the hidden nodes 
obey a Gaussian joint distribution, then so do the observed nodes and the second moment completely 
characterizes their joint distribution. But in general, the second moment provides strictly smaller 
amount of information than the entire joint distribution. This makes our result robust to the noises 
in the observations as it relies on them only through the second moment. 

We next consider two ensembles of DAG models, namely multi-level DAGs and DAGs with 
effective depth one. Building upon our core result, we show that for these ensembles the induced 
model among the hidden nodes is also identifiable. 

Multi-level DAGs. This ensemble contains graphs with a hierarchal structure. The nodes of a multi- 
level DAG can be partitioned into levels L\, . . . , L m , such that there is no edge within a layer and 
all the edges are between nodes in layer Lj and the nodes in the adjacent layers Lj_i and Lj+i (See 



Fig. 1(a) for an illustration). Assuming that the induced model between layers Lj and Lj+i obey 
the conditions in Section |2.2| for i = 1, . . . , m — 1, we show that the entire model can be learned in a 
sequential manner. 

DAGs with effective depth one. A DAG has effective depth one if any hidden node has at least one 



observed neighbor (See Fig. [1(b) for an illustration). Now suppose that the dependence relationships 



among the hidden nodes are also linear and are described as follows: 

hj = ^2 x jt h e + for i G W' ( 6 ) 

where {rij}j^\k] denote the noise terms. For models in this class, we use Excess Correlation Analysis 
(EC A) |4| to learn the model from the third order moment of the observed variables. Here, we 
assume that the noise variables at the hidden nodes are non-Gaussian (e.g., they have non-zero third 
moment or excess kurtosis). 

Our presentation focuses on using exact (population) observed moments to emphasize the cor- 
rectness of the methodology. However, "plug-in" moment estimates can be used with sampled data. 
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(a) Full ternary tree 



(b) Caterpillar tree 



Topics 



Word counts 
in a document (' 




(c) Graph representation for topic models 



Figure 2: Concrete examples of graphs from the ensembles of multi- level DAGs and DAGs with 
effective depth one. Observed nodes and hidden nodes are respectively shown by shaded and white 
circles. Using the results of this paper, these graphs are identifiable, under the linear dependence 



model (Section 2.2), from second- and third-order moments of the observed variables. 



To partially address the statistical efficiency of our method, note that higher-order empirical mo- 
ments generally have higher variance than lower-order empirical moments, and therefore are more 
difficult to reliably estimate. Our techniques only involve low-order moments (up to third order). 
A precise analysis of sample complexity involves standard techniques for dealing with sums of i.i.d. 
random matrices and tensors as in [4] and is left as a future study. 

Learning algorithm. The above results already imply identifiability of the aforementioned DAG 
models through exhaustive search. We also present some conditions on the coefficient matrix A, 
under which we can efficiently learn the columns of A from the second order moment, by solving a 
set of convex optimization problems. This leads to efficient algorithms for learning multi- level DAGs 
and DAGs with effective depth one (Algorithm 1 and Algorithm 2). 

Examples. It is useful to consider some concrete examples of multi- level DAGs and DAGs with 
effective depth one, which satisfy the expansion property. Using the results of this paper, under the 
rank condition and the parameter genericity property for matrix A, these models are identifiable. 

Full d-regular trees. These are tree structures in which every node other than the leaves has d 
children. These are included in the ensemble of multi-level DAGs and it is immediate to see that 



for d > 3, the model can be identified under the described model in Section 2.2 (Note that d > 2 
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suffices for expansion property but d > 3 is necessary for the rank condition). See Fig. 2(a)| for an 
illustration of a full ternary tree with latent variables. 

Caterpillar trees. These are tree structures in which all the leaves are within distance one of a central 
path. See Fig. 2(b)| for an illustration. These structures have effective depth one. Let d max and d B 



respectively denote the maximum and the minimum number of leaves connected to a fixed node on 
the central path. It is immediate to see that if d m \ a > <i max /2 + 1, the structure has the expansion 
property. 

Random bipartite graphs. Consider bipartite graphs with hidden nodes in one part and observed 
nodes in the other part. Each edge (between the two parts) is included in the graph with probability 
9, independent from every other edge. It is easy to see that, for any set S C [k], the expected number 
of its neig hbors is : E\N(S)\ = n(l - (1 - 6>) 151 ). Also, the expected de gree of the hidden nodes is On. 
Now, by applying a Chernoff bound, one can show that these graphs have the expansion property 
with high probability, if k < 6n/2, i.e., with probability converging to one as n — > oo. 

Application to correlated topic models. An important application of the results of this paper 
is in estimating topic models with correlated topics. Topic models are a popular family of mixture 
models that incorporate latent variables, the topics, to explain the observed co-occurrences of words 
in documents. Each document has a mixture of active topics and each active topic determines the 
occurrence of words in the document. A topic model can be viewed as a bipartite DAG with topics in 



one part and the observed nodes in the other part. See Fig. [2(c) for an illustration. (As an example, 



one may think of the i-th observed variable as the word counts in the i-th sentence of a document). 
Using this representation, estimating the topics from the document is exactly the learning problem 
of the corresponding DAG. Existing work on estimating topic models provide results for certain 
distributions over the topics. For instance, in independent component analysis (ICA), the topics are 
assumed to be independent, while in Latent Dirichlet Allocation (LDA), a Dirichlet prior is assigned 
to the distribution of topics in documents. However, it has been observed empirically that correlated 



topic models provide better fit for document modeling [10 , 34 . A popular correlated topic model, 
termed as Pachinko allocation involves multi-level DAGs for modeling word dependencies. We can 
now efficiently learn a rich class of similar correlated topic models. 

It is convenient to discuss a concrete example which further showcases the applicability of our 
results for topic models. Consider the linear model as described by Eqs. ([2]),([6]) and suppose that the 
noise variables are independently Poisson random variables and all hidden and observed variables are 
Poisson. Note that sum of independent Poisson random variables is also Poisson, and therefore this 
is a valid model. This scenario is readily applicable for topic modeling since we can interpret each 
observed Poisson variable as specifying the count of a certain word, and each hidden Poisson variable 
as giving the count of a certain topic, and there can be arbitrary dependencies among the hidden 
topics. Prior to this work, even basic parameter and structural identifiability of such correlated topic 
models was not known. Our work gives, for the first time, a computationally efficient estimator that 
relies on estimation using only low-order moments. 

2.4 Our techniques 

Our proof techniques rely on ideas and tools developed in dictionary learning, matrix decomposition, 
and method of moments. We briefly explain our techniques and their relations to these areas. 
Matrix decomposition into diagonal and low-rank parts. To prove our core result, we first 
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observe that under the linear model, E[xx T ] is the sum of a low-rank matrix and a diagonal one: 

E[xx T ] = AE[hh T ]A T + E[ee T ]. 

We prove that under the rank condition (Condition [2]), E[xx T ] can be decomposed into its low- 
rank component AE[/i/i T ]A T and its diagonal component E[ee T ]. This means that we can remove 
the noise contribution from the second order moment. Moreover, rank(^4E[/i/i T ] J 4 T ) = k gives the 
number of hidden nodes. We propose a simple algorithm (Subroutine) for this decomposition. 

It should be noted that additive matrix decompositions into low-rank and diagonal (or sparse) 



terms have been considered in previous work 16 27 43 . Using the techniques of [43], we can relax 



Condition [2] to k < n/2, but only by imposing additional strong incoherence conditions on the 
low-rank component. 

Dictionary learning. We proceed by showing that using the graph expansion property (Condi- 
tion [3]), one can recover A from the low-rank part ^4E[/i/i t ]j4 t , obtained from the decomposition 
of the observed covariance matrix, as described above. To prove this claim, we leverage the ideas 
developed in |46| for the dictionary learning problem. In [46], the authors consider the problem of 
learning sparsely used dictionaries with an invertible dictionary and a random, sparse coefficient ma- 
trix, Bernoulli-Gaussian and Bernoulli-Radamacher models. They establish that the dictionary and 
the coefficient matrix can be learned from exact measurements. The gist of the idea is that under 
the above conditions, the row space of the coefficient matrix is the same as that of the measurements 
matrix. The rows of the coefficient matrix are then the sparsest vectors in the corresponding space. 

Notice that here we are in the same situation. Since E[/i/i T ] and A have full column rank, we 
have Col(.A) = Col(AE[/i/i T ]A T ). However, in contrast to the setting in (46], the coefficient matrix 
A is not generated from a probabilistic model. We introduce the graph expansion property as the 
underlying notion which makes the recovery of A possible. In fact, it can be shown that the considered 



probabilistic models in 46 , possess this property almost surely. Our core result (identifiability of 
A) is established by showing that, under the expansion property for the model, the columns of A are 
the sparsest vectors in Col(AE[hh T ]A T ). 

Method of moments. For DAGs with effective depth one, observe that the hidden variables are 
related to each other and to the noise terms {r]j}je[k] y i a linear equations Q. Define A E M fcxfc by 
letting the entry be Aj,- if j 6 PAj and zero otherwise. Solving for the hidden variables hj, we 
have h = (I — A) _1 r/, with rj := (771, . . . ,rjk). The observed variables are also related to the hidden 
ones via the coefficient matrix A. The idea is to consider an equivalent DAG model obtained by 
suppressing the hidden nodes hj and treating the noise terms rjj as the new uncorrelated topics. 
The observed variables Xi are then related to the new topics through the matrix A(I — A)" 1 . Next, 
we apply ECA method [4] to learn A(I — A)~ T from the second and third order moments of the 
observed variables. ECA is based on two singular value decompositions: the first SVD whitens the 
data (using second moment) and the second SVD uses the third moment to find directions which 
exhibit information that is not captured by the second moment. Finally, in order to identify the 
dependence structure among the hidden nodes (matrix A), we use the expansion property to extract 
A and A from A(I — A)~ T . The high-level idea is depicted in Fig. [3J 

2.5 Related work 

The problem of identifiability and learning graphical models from distributions has been the object of 
intensive investigation in the past years and has been studied in different research communities. This 
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Figure 3: The high-level idea of the technique used for learning DAGs with effective depth one. In 
the leftmost graph (original DAG) the hidden nodes depend on each other through the matrix A and 
the observed variables depend on the hidden nodes through the coefficient matrix A. We consider 
an equivalent DAG with new uncorrelated topics rjj (these are in fact the noise terms at the hidden 
nodes). Here, the observed variables depend on the hidden ones through the matrix A(I — A) -1 . 
Applying ECA method, we learn this matrix from the (second and third order) observed moments. 
Finally, using the expansion property of the connectivity structure between the hidden part and the 
observed part, we extract A and A from A(I — A) -1 . 



problem has proved important in a vast number of applications, such as computational biology (22|42 



economics [7|[T2}[24|[52] , sociology (9j[T2}[30j[5l] , and computer vision [19,32]. The learning task has 
two main ingredients: structure learning and parameter estimation. 

Structure estimation has been extensively studied in the recent years. It is well known that 
maximum likelihood estimation in fully observed tree models is tractable 20 . However, for general 



models, structure learning is NP-hard even when there are no hidden variables. The main approaches 
for structure estimation are score-based methods, local tests and convex relaxation methods. Score- 
based methods such as fl7[ find the graph structure by optimizing a score, like Bayesian Independence 
criterion (BIC), in a greedy manner. Local test approaches attempt to build the graph based on local 
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In the presence of latent variables, structure learning becomes more challenging. A popular class 
of latent variable models are latent trees, for which efficient algorithms have been developed [3 18,21 



23 . Recently, approaches have been proposed for learning (undirected) latent graphical models with 



long cycles in certain parameter regimes [6]. In 15 , latent Gaussian graphical models are estimated 



using convex relaxation approaches. The authors in [45] study linear latent DAG models and propose 
methods to (1) find clusters of observed nodes that are separated by a single latent common cause; 
and (2) find features of the Markov Equivalence class of causal models for the latent variables. Their 
model allows for undirected edges between the observed nodes. In [2], equivalence class of DAG 
models is characterized when there are latent variables. However, the focus is on constructing an 
equivalence class of DAG models, given a member of the class. In contrast, we focus on developing 
efficient learning methods for latent DAGs. 

For parameter estimation with hidden variable models, the traditional approach is expectation 
maximization (EM) algorithm, which finds a local maximizer of the likelihood. Unfortunately, op- 
timality and recovery guarantees are generally lacking for EM, even when the model is correct. 
Another approach is to constrain the dependency structure among the hidden nodes. For instance, 
in independent component analysis (ICA) |28|, it is assumed that the latent variables obey a product 
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distribution and hence in the corresponding graph model there is no edge between the latent variables 
(There are only directed edges from latent nodes to the observed nodes). Several generalizations of 
ICA have also been developed where clusters of dependent components are learned, e.g. [8] considers 
tree component analysis where the latent nodes form a tree model; [49| considers independent sub- 
space analysis; and |4j considers latent variables to be drawn from latent Dirichlet allocation (LDA), 



relevant in topic modeling 11 . These approaches are based on the method of moments, where the 
observed moments are matched to those specified by the model. In this paper, we use ideas from the 
method of moments to establish identifiability and efficient recovery for DAG models. 



3 Main results 

In this section, we state our identifiability results and algorithms for learning the DAG models with 
latent variables. 

3.1 Learning the coefficient matrix A 

Theorem 3.1. Let £ := E[xx T ] be the second order moment of the observed variables. For the model 
described in Section 2.2 (Conditions^ [3| [^p, all columns of A are identifiable from E. 



Theorem 3.1 is proved in Section |4.1[ As shown in the proof, columns of A are in fact the 
sparsest vectors in the space Col(AE[hh T ]A T ). This result already implies identifiability of A via an 
exhaustive search, which is an interesting result in its own right. The following theorem provides some 
conditions under which the columns of A can be identified by solving a set of convex optimization 
problems. Before stating the theorem, we need to establish some notations. 

For i G [n], we define N, := {j G [k] : Ay J= 0} and N? := {I G [n] : A tj ^ for some j G NJ. 
Similarly, for j G [k], define Nj := {i G [n] : Ay / 0} and := {I G [k] : An / for some i G N 3 -}. 
We use superscript c to denote the set complement. 

Theorem 3.2. Suppose that in each row of A, there is a gap between the maximum and the second 
maximum absolute values. For i G [n], let m be a permutation such that |cij j7ri (i)| > |ai )7rj (2)| — 
• • > |oi,7r<(fc)l> an d l a i,Tr < (l)l/l°t,7r i (2)l < 1 ~ 7i> f or some H > °- Further suppose that [k] C 
{7Ti(l), . . . , 7r n (l)}. In words, each column contains at least one entry that has the maximum ab- 
solute value in its row. If the following conditions hold true for i G [n], then ALGORITHM 1 returns 
the rows of A in canonical form. 

(i) ||^4( N 2) C) ( N .) C > ||^4n 2 ,(n,) c v \\l f or a M non-zero vectors v G ]Rl( Ni ) c l . 

(ii) \\Aq$ )o,NAj v \\i > || An ■ n»\j v \h + (1 — t)||An,j Hillwlli for all j G N, and all non-zero vectors 



Theorem 3.2 is proved in Section 4.2 Algorithm 1 is essentially ER-SpUD presented in [46 for 



exact recovery of sparsely-used dictionaries; but the technical result and application are novel. 



According to Theorem 3.1, we can learn the coefficient matrix A of the model without any 
assumption on the dependence relationships among the hidden nodes. (We only need the non- 
degeneracy assumption discussed in Condition [T] which requires that the hidden variables be linearly 
independent with probability one.) 
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Subroutine: Decomposition of a matrix into its low-rank and diagonal parts. 



Input: Matrix C = AB T + D, with A,B £ R nxk , D G R nxn diagonal, and partition V of [n]. 
Output: Diagonal part D and low-rank part L = AB T . 

1: for each / G V do 

2: Choose distinct J, K £ V\{I}. 

3: Let Ui G M^' xfc be the matrix of left singular vectors of Cj j- 
4: Let Vj G ]Rl ,7 l xfc be the matrix of right singular vectors of Cj j. 
5: Let f/x G ]Rl A 'l xfc be the matrix of left singular vectors of Ck,j- 
6: Set A T BJ = C hJ Vj{UlC K ,jVj)- l UlC K j. 
7: Set Dj,j = dj-AjBj. 
8: return D and L = C - D. 



Algorithm 1: Recovering columns of coefficient matrix A from the second order moment S. 

Input: Second order moment of the observed variables S. 
Output: Columns of v4 up to permutation. 

1: Find a partition V of [n] such that \V\ = 3 and rank(S/ 5 j) = k for distinct I,JeV. 

2: Let L be the low-rank part returned by Subroutine(S, V). 

3: for each i G [n] do 

4: Solve the optimization problem 

min ||L 1/2 w||i subject to (ej L 1/2 )w = 1. 

5: Set Sj = L l l 2 w, and let 5 = {si, . . . , s n }- 

6: for each j = 1, . . . , k do 

7: repeat 

8: Let be an arbitrary element in S. 

9: Set S = S\{ Vj }. 

10: until rank([wi| • • • \ vj]) = j 

11: Set A= [«i| ■■>*]■ 

12: Let B be a left inverse for A, i.e., BA = Ikxk- 

13: return Columns of A(dia,g{BLB T )) 1 / 2 . 
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Note that the coefficient matrix A does not completely specify the distribution, as the hi's are 
not necessarily statistically independent, and we can hope to learn the correlation structure among 
the h^s. We next consider two families of DAG models, namely multi- level DAGs and DAGs with 
effective depth one. For these families, we proceed further and prove identifiability of the entire 
model. 



3.2 Multi-level DAGs 

Definition 3.3. A multi-level DAG model is a model with the following graph structure. The nodes 
of the graph can be partitioned into levels L\, .. . ,L m such that there is no edge between the nodes 
within one layer and all the edges are between nodes in adjacent layers, (Lj,Lj + i) for i G [m — 1]. 
Furthermore, the edges are directed from Li to Li + \. The nodes in layer L m correspond to the 
observed nodes and other layers contain the hidden nodes. 

Next theorem concerns identifiability of linear multi-level DAGs. More specifically, consider a 
multidevel DAG model and let Qi be the induced graph with nodes L, L U Lj+i and suppose that the 
induced model between levels Lj and Li + \ satisfies the model conditions described in Section 2.2 



with coefficient matrix A$, for % G \m — 1]: Ai has the rank condition (Condition [2| and parameter 
genericity property ( Condition [4]), and (bipartite) graph £/, has the expansion property (Condition [3]). 

Theorem 3.4. Consider a multi-level DAG with levels L\, . . . ,L m and suppose that the induced 



model between layers Li and Lj+i satisfies the model conditions described in Section 2.2 with coef- 
ficient matrix Ai, for i G [m — 1]. Then all columns of Ai are identifiable for i G [m — 1] from the 
second order moment of the observed variables X. Therefore, the entire DAG is identifiable up to 
permuting the nodes within each level. 



Theorem |3.4|is proved in Section 4.3 



Remark 3.5. By definition of multi-level DAG, the hidden nodes in level L\ are independent. Now 
consider the case that the nodes in L\ have arbitrary dependence relationships. By using the same 



argument as in the proof of Theorem 3.4 , we can still learn all the coefficient matrices Ai and the 
second order moment of the nodes in L\ . 



3.3 DAGs with effective depth one 

Definition 3.6. The effective depth of a DAG model with hidden nodes is the maximum graph 
distance between a hidden node and its closest observed node. 

In particular, in a DAG with effective depth one every hidden node has at least one observed 
neighbor. Recall that the observed and the hidden nodes obey the linear model 

Xi = aijhj + Si, for i G [n]. (7) 
jePA, 

Assume further that the hidden variables obey the linear model 

hj = ^2 X j£ h e + Vj, for j G [k]. (8) 
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Let A £ M. kxk be the matrix with Xy at the (i, j) entry if j G PAj and zero everywhere else. 



As described in Section 2.2 without loss of generality, we assume that hidden variables hj, the 
observed variables xi and the noise terms £i,rjj are all zero mean. We also denote the variances of £j 
and r]j by a 2 e . and . , respectively. Let fj, £i and fi Vj respectively denote the third moment of £j and 
rjj, i.e., fi £i := E[ef] and ix r]j := E[t/|]. Define the skewness of rjj as: 

•* - (») 

Finally, denote the second and third order correlations of the observed variables as: 

£ := Mxx T ], 
^ := E[x (8) x <g> a;], 

where ® denotes the tensor product. It is convenient to consider the projection of ^ to a matrix as 
follows: 

v]/(C) :=E[xx T (C,x)}, 
where (•, •) denotes the standard inner product. 

Theorem 3.7. Consider a DAG model with effective depth one, which satisfies the model conditions 



described in Section 2.2 and the hidden variables are related through linear equations ([8j). If the noise 
variables rjj have non-zero skewness for j £ [k], then the DAG model is identifiable from S and ^(C); 
for an appropriate choice of(. Furthermore, under the assumptions of Theorem \3.2\ ALGORITHM 2 
returns matrices A and A up to a permutation of hidden nodes. 



Theorem |3.7| is proved in Section 4.4 In Theorem 3.7 we prove identifiability of DAGs with 
effective depth one, from the second and third order moments. A natural question is what can be 
done if only the second order moment is provided. The following remark states that if an oracle gives 
a topological ordering of the DAG structure then the model can be learned only through the second 
order moment and there is no need to the third order moment. 

Remark 3.8. A topological ordering of a DAG is a labeling of the nodes such that, for every directed 
edge (j,i), we have j < i. It is a well known result in graph theory that a directed graph is a DAG 
if and only if it admits a topological ordering. Now, consider a DAG model with effective depth 
one and suppose that an oracle provides us with a topological ordering of the induced DAG on the 
hidden nodes, i.e., for any labeling of the hidden nodes the oracle returns a permutation of the labels 
which is faithful to a topological ordering of the DAG. Then, the DAG model (matrices A and A) are 
identifiable from only the second order moment X. 

Remark |3.8| is proved in Appendix |Dj 

Remark 3.9 (Learning fully-observed DAGs). An interesting and immediate application of the 
technique used in the proof of Theorem |3, 7| is in learning fully-observed DAGs. Consider an arbitrary 
fully-observed linear DAG: 

Xj= y~] XjjXj + rjj, fori£[n], (11) 

iePA 4 

and suppose that the noise variables rji have non-zero skewness. Then, applying the same argument 



as in the proof of Theorem 3.7, we can learn the matrix (I — A) (and hence A) from the second 
and third order moments. 
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Algorithm 2: Learning DAGs with effective depth one. 



Input: Vector G M fc , observabie moments E and \I/ as defined in Eq. (10). 
Output: Columns of A, matrix A (in a topological ordering). 
Part 1: Decomposition of E. 

Find a partition V of [n] such that \V\ = 3 and rank(S/ ) j) = k for distinct I, J £ V. 
Let Ls and Ds be the low-rank and the diagonal parts returned by Subroutine(E, V). 
Part 2: ECA. 
Find a matrix U G M nx/c such that Col(*7) = Col(L s ). 
Find V G M fexfc such that U T ([/ T L s L0^ = 4xfe- Set W = LTV. 

Find a partition V of [n] such that \V\ = 3 and Tax\k(^(W0)i t j) = k for distinct I, J £ V. Let 

and be the low-rank and the diagonal parts returned by Subroutine(^(VF6'), V). 
Let f2 be the set of (left) singular vectors, with unique singular values, of W T L^W . 
Let 5 G R nxk be a matrix with columns {{W + ) t uj : w G ft}, where W + = (W r W)~ 1 W T . 
Part 3: Finding A and A. 
Let A = Algorithm 1(E). 
Let B be a left inverse of A. Let C = BS. 

Reorder the rows and columns of C to make it lower triangular. Call it C . 
return Columns of A and A = I — diag(C') (7 _1 . 



3.4 Remark on finding the partition V 

The rank condition for matrix A ensures the existence of a partition V of [n], such that, \V\ = 3 and 
Aj G M. nxk has full column rank for all I € V. However, we are not provided with such a partition 
and therefore in Algorithm 1 and Algorithm 2 we need to search for V . The complexity order of 
this searching step is n k . Here, we show that under an incoherence assumption about A, a random 
partitioning of its rows into three groups has the desired property, with fixed positive probability. 

Definition 3.10. Let A = USV T be a thin singular value decomposition of A, where U G W nxk has 
orthonormal columns, S = diag(cii(j4), . . . ,o~k(A)), and V G M. kxk is orthogonal. Define the incoher- 
ence number of A as: 



n 



c A :=max<! 'r\\U ' ejg } . (12) 



Lemma 3.11. Fix t G [n], and consider I random submatrices A\,A2, . . . ,A# of A obtained by the 
following process: for each row of A, independently choose one of the I submatrices uniformly at 
random, and put the row in that submatrix. Fix 5 G (0, 1). Then, 

P{o- k (A v ) > a k (A)/{2Vl),Vv G > 1 - 8, (13) 
provided that ca < §2 ' in kl ' 

5 



Lemma 3.11 is proved in Appendix [E] Using this lemma with I = 3, we obtain the following. For 
A G M. nxk with full column rank and a random partitioning V of its rows into three groups, all the 



14 



submatrices Aj, I 6 V are full rank with probability at least 1 — 6, provided that 

3 n 
32 ' &ln~ 



ca < ^ ■ (14) 
5 

4 Proof of the theorems 

4.1 Proof of Theorem 13.11 

Observe that 

£ = E[xx T ] = E[(Ah + e){Ah + e T )] 

= ^E[M T ]^ T + E[ee T ]. ^ ' 

Since the hidden variables are linearly independent, E[/i/i T ]is full rank. Otherwise, v T E[hh T ]v = 
for some non-zero vector v. This implies that E[||/i T v|| 2 ] = and so h T v = which leads to a 
contradiction. Find a partition V of [n], such that \P\ = 3, and rank(S/ j j) = k for all distinct 
I, J 6 ?. (Note that rank(S/ i j) = rank(^4/E[/i/i T ]ylj) and by rank condition, there exists such a 
partition V). We first show that Subroutine(X, V) returns AE[/i/i t ]j4 t and the diagonal matrix 
E[ee T ]. 

Lemma 4.1. Let C = AB T + D, with A,Be R nxk and D G R nxn a diagonal matrix. Suppose that 
for a fixed partition V of [n], with \V\ = 3, all the submatrices Aj and Bj have full column rank k, 
for all I EV- Then, Subroutine(C) returns AB T and D. 

The proof of Lemma |4.1| is deferred to Appendix [B] 

Given that E[hh T ] and A have full column rank, we have Col(yl) = Col(AE[hh T ]A T ). Let 
{u\, . . . ,Uk} be any basis of Co1(j4E[/i/i t ]j4 t ) containing vectors with k smallest lo norm. Since all 
the columns of A have at most d max non-zero entries, we have maxjg^ \\ui\\o < d max , by choice of 
vectors Next we show that due to the graph expansion property (Condition [3]) and the parameter 
genericity property (Condition [4]), vectors Uj are (scaled) columns of A. Observe that any vector u, L 
can be represented by a linear combination of columns of A, say U{ = Av. If ||f ||o > 2, then 



\ui\\ = \\Av\\ > |N^(supp(u))| - | supp(t>)| > d 



mux ■ 



where the first inequality follows from parameter genericity property and the second one follows from 
the expansion property. This leads to a contradiction. Therefore, ||v||o = 1, and Ui is scaled version 
of a column of A. Since U{ are linearly independent, different Ui correspond to different columns of 
A. Let A = [u\ \ ■ ■ ■ |itfc]. Then, there exists a permutation matrix II and a diagonal matrix A such 
that A = All A. We recover the scaling matrix A using the fact that A is in canonical form. 
Let B be a left inverse of A. We have 

BAE[hh T ]A T B T = A II '•{//// II A . (16) 

Consequently, 

di&g(BAE[hh T ]A T B T ) = diagtA^rr^/i/i^IL^A^) = A~ 2 , (17) 
where the last step follows from diag(E[/i/i T ]) = Ikxk as A is in a canonical form. Finally, 

A(diag(BAE[hh T ]A T B T )) 1/2 = AA' 1 = All. (18) 
Therefore, we have identified all columns of A. 
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4.2 Proof of Theorem Q 



Recall that £ = AE[hh T ]A T +E[ee T ]. Using Lemma 4.1 Subroutine^, V) (in the second step 
of Algorithm 1) returns the low-rank part L = AE[/i/i T ]yl T and the diagonal part E[ee T ]. The 
following lemma shows that vectors Sj, returned by the loop (steps (3) — (5)), are scaled multiples of 
the columns of A. 

Lemma 4.2. Let A G W ixk be a given matrix with rank k, and let L 1 / 2 G M nxk be such that 
L 1 / 2 = AM, for an invertible M G R kxk . (Equivalently Co\(A) = Co\(L)). Fix i G [n] and consider 
the following optimization problem: 



mm 



IL 1 / 2 



w\\i 



subject to (el L l l 2 )w = 1. 



(19) 



Under the following conditions, si = L l / 2 w is a scaling of the iii(\)-th column of A. (Recall that 
7Tj(l) is the index of the entry with maximum absolute value in the i-th row of A). 

(i) ||^4(N 2 )c,(Ni) c > II^n 2 ,(n,) c f or a U non-zero vectors v G Rl( N< ) c l. 

(ii) WA^^y^^j v \\i > ll^-Nj.NiXj v \\i + (1 — 7)ll-^-Njj||i||^||i for all j G N, and all non-zero vectors 



v G 



»|N;|-1 



Proof (Lemma 4-2). Consider the following equivalent formulation of Problem (19) obtained by the 
change of variables z = Mw, b T = (ej L 1 / 2 )^!^ 1 : 



mm 



\Az\ 



subject to b T z = 1. 



(20) 



Observe that b T is the i-th row of A. Denote the solution to Problem (19) by z*. We aim to prove 
that is supported on {^(1)}. We prove the desired result in two steps: 

Claim 4.3. Under Condition (i), we have supp(z*) C supp(6). 

Claim 4.4. Under Condition (i) — (ii), we have supp(^*) = {^(1)}. 



Proof (Claim 4-3). Notice that b T = eJA, and so supp(ft) = Nj. Define zq G M. k by Zo(j) ■= z*(j) 
for all j G supp(fr), and zo(j) := for all j ^ supp(6). Also, let z\ := — zq. Therefore, zq is also a 
feasible solution to Problem (20), since b T zq = 6 T z*. 
If 2l 7^ 0, then 



\Az 



* i 



> 



> 



II^N 2 ,[£:] z *h + \\ A (N^,[k] z *h 
\\ A Nl[k] ( z + z l)\\l + P(N?)<=,[ft] z l\U 
II^N 2 ,[fcpo||l - H^N^ZlHl + ||A( N 2) C) [ fc ] Zl\\l 

\\Az \\i - \\A N 2 t [ k ]Zi\\i + P (N 2 )Ci[fc] zi||i 
ll^olli, 



where the last inequality follows from Condition (i) and the fact supp(zi) C (Nj) c . Therefore, zq is a 
feasible solution with smaller objective value, which contradicts the optimality of z*. Therefore we 
conclude that z\ = 0, and hence supp^*) C supp(fr). □ 
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Proof (Claim 4-4)- By Claim 4.3, supp(z*) C supp(6) = Nj. To lighten the notation, let j = 7Tj(l), 
and define zq := (ej z*)ej and Z\ := — zq. Suppose for sake of contradiction that z\ ^ 0. Since 
b T z* = 1, we have zo = ((1 — b T zi)/bj) ej. Therefore (using the triangle inequality twice), 

||Az*||i = ||^4 Nj> [ fc ]Z*||i + (^(N^c^jz* ||i 

= II^N^fcJ^O + £l)||l + m(Nj) c ,[fe]^l||l 

> II^Nj,[fcpo||l - II^NjJfcl^llll + ll^(Nj) c ,[fc] z l||l 

= II^N^tX 1 - bTz l)/ b j) e j\\l - \\ A Nj,[k]^l\\l + P(Nj)<=,[fc]^l||l 

> ( 1 / 1 & j ) 1 1 ^N 3 - , [fc] e j 1 1 1 - | /bj 1 1| A Ni)W || i - HA^ji-piHi + ||A( Nj ) Ci[fc pi||i. 

Since Z\(j) = 0, we have \b T z\\ < |&| 7 r i (2)||' z l|i by Holder's inequality, and therefore, 

|6 T *i| ^ l&U(2)||^l||l ^ „ v„ II 
l^T^ |% < (l-^)INllk- 

Moreover, by Condition (ii) and the fact supp(zi) C Nj \j, 

pN<=,[fcpl||l > ||^N,-,[fcpl||l + (1 - TiJII^-Nj-j 111 Iki Hi- 
Putting the last three displayed inequalities together gives 

||A%,||i > (l/lftiDH^.weilli = ||^/6j-)||i. 

Since &j/bj is a feasible solution, the above strict inequality contradicts the optimality of z*. Therefore 
we conclude that z\ = 0, and = zq = ej/bj. □ 

Notice that Sj = L l / 2 w = AMw = Az* and since supp(z*) = {^(1)}, Si is a scaled multiple of 



the 7Tj(l)-th column of A. This completes the proof of Lemma 4.2 □ 



Now, we are ready to prove the theorem. 

Given that Conditions (i) — (ii) hold for all i £ [n], using Lemma |4~2{ the set S = {si, . . . , s n } 
consists of scaled multiples of the columns of A. Moreover, since [k] C {71*1(1), . . . , ir n (l)}, S contains a 
scaled multiple of each column of A. In the loop (steps (6) — (10)), we choose a linearly independent set 
{vi, . . . , Vk} Q <S- These are the (scaled multiples of the) columns of A. Hence, letting A = [vi \ ■ ■ ■ |u&], 
there exists a diagonal matrix A G ~R kxk and a permutation matrix tt 6 M. kxk such that A = AHA. 
Let B be a left inverse of A. We have 

BLB T = BAE[hh T ]A T B T = A- 1 n- 1 E[Wi T ]ir T A- T . (21) 

Consequently, 

diag(BLB T ) = diag(Ar 1 ir 1 E[hh T ]ir T AT r ) = A' 2 , (22) 
where the last step follows from diag(E[/i/i T ]) = Ikxk as A is in a canonical form. Finally, 

A(diag(BLB T )) 1/2 = iA" 1 = All. (23) 
Therefore, Algorithm 1 returns all the columns of A. 
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4.3 Proof of Theorem 

We identify the matrices Ai (up to permutation of their columns) in a sequential manner. Let hi i 
denote the vector formed by the hidden variables in level Lj, for i G [m — 1]. Also, let ei i be the 
noise vector formed by the noise variables associated to the hidden nodes in layer L,, for i G [m — 1]. 
Write 

S = An-iE^.^J^ + E[ eLm _ ie I m _J. (24) 



By applying Theorem 3.1, we can identify the columns of A m _\. Equivalently, we recover 
A m _i = j4 m _in m _i for some permutation matrix II m _i. Let S m _i be a left inverse of A m -\. 
As demonstrated in the proof of Theorem |3.1[ we can decompose X into its low-rank and diagonal 



parts. Therefore we have access to A m ^{&[hi m _ 1 h T Lm JA^j. Now, notice that 

Bm-l^-lE^Wl^L-J^-l^m-l = n^-^^-^L-Jnm-l- (25) 

In words, we can recover the second order moment of the hidden variables in level L m _i, up to 
a permutation of the nodes within this layer. Using the same technique sequentially, we can recover 
all the columns of Ai for i£ [m — 1] and thus the entire DAG is identifiable up to permutation of 
hidden nodes within each level. 

4.4 Proof of Theorem 13.71 

Let 77 := (771 , - - - , rjk) and e := (ei, . . . , e n ). Using the model description, we have 

x = A(I-A)' 1 r ] + e. (26) 
Define M := A(I - A)" 1 G R nxk . Then 



S = E[xx T ] 

= E[(Mr] + e)(Mr] + e) T } 
= ME[ W T ]M T +E[ee T ] 
= M dmg(a 2 m ,...,ai)M T + dmg(al ,...,a 2 £ 



(27) 



Given that ^4 satisfies the rank condition, it is immediate to see that Mdiag(cr 7?1 , . . . , at ) also 



satisfies the rank condition. Therefore, applying Lemma 4.1 we can decompose £ into its low-rank 
part (Ls) and its diagonal part (Dj;), where 

L s = Mdiag(^ i ,...,^)M T , (28) 
D s = diag(4,...,4). (29) 

Since A has full column rank, U T L^U G M fexfc also has full rank; hence, the whitening step (Part 
2 in Algorithm 2) is possible. We have 

/ = W T L^W = W T Mdiag{a 2 ni , . . .,a 2 k )M T W. 
Therefore, the matrix N := W T M diag(cr^ 1 , . . . , 0^) G ~R. kxk is an orthogonal matrix. 



18 



Lemma 4.5. We have 

*(C) = diag(At £1 Ci, • • • , v £n Cn) + Mdiag(/v, • • • , MrJ diag(M T C)M T . (30) 
Lemma |4,5| is proved in Appendix [Cj 

Applying Lemma 4.1 again, we decompose *$>(W9) into its diagonal and low-rank parts. 

= Mdiag(/j rn ,...,^Jdiag(M T W^>)M T , (31) 

Dy = di ag (n £1 (we) 1 ,..., t i £n (we) n ). (32) 

Now, observe that 

W T L^W = 

W^Mdiag^, . . . ,fi m ) diag(M T W9)M T W = (33) 
A r diag(cr m , . . . , a^J -1 diag^, . ..,fJ. m ) diag(M T iy6>) diag^, . . . , a Vk )~ 1 N r 

Since A" is an orthogonal matrix, the above is an SVD of W T L$,W, and N±, . . . , Nk are singular 
vectors, where Aj denotes the i-th column of N. Note that Aj = a Vi W T Mi for i £ [k]. 

A key observation is that an SVD uniquely determines all singular vectors (up to sign) which 
have distinct singular values. Following a similar approach to (21, we sample uniformly at random 
over the sphere in M fc to ensure that all the singular values of W T L^W are distinct. Define 

D := diag(o- 7?1 , . ...o^) -1 diag(^ m , . ..,/%) diag(M T W6>) diag(cr 7?1 , . . .^J" 1 . (34) 

Note that the diagonal of the matrix D is the following vector: 

diag(<T r?1 , . . .,<T r?fe ) _1 diag(/i ryi ,. . . diag(cr m , . . . ,0v,J _1 M T W0 
= diag(cr^, . . . ^J -1 diag(/i^, . . . diag(cr r?1 , . . . , a Vh )~ 2 N T 9. 

Since 9 is sampled uniformly over the sphere, and A^ is a rotation matrix, the distribution of N T 9 
is also uniform over the sphere. Consequently, all the singular values of W T L\i,W are non-zero and 
distinct. Therefore, the set (in step (8) of the algorithm) is given by 

n = {a Vi W T M^ =1 . 

The columns of matrix S, defined in step (9) of the algorithm, are then 

{{W + ) T co :we!l} = {W{W T W)- l a Vi W T Mi} k i=l 

= {w{w T wr x w T a m MAJ? =l = K^jfu, 

where the last step holds since W A (M /T VF)~ 1 M^ T is a projection and Range (W) = Range (U) = 
Range(Ls) = Range(M). Hence, there exists permutation IIi, such that 

S = M di&g((X Vl , . . . ,0-^)111 = A(I - A)~ 1 diag(o- ??1 , . . . ,cr % )IL.. 



Note that Col(S') = Col(^4) and by (ii). As demonstrated in the proof of Theorem 3.1 



we 



can identify all the columns of A, as A satisfies the graph expansion and the parameter genericity 
property. Moreover, under the assumptions of Theorem 3.2, Algorithm 1(S) returns all columns 
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of A. Therefore, we can recover A = AH2, for a permutation matrix 1T2 £ K fexfc . Let B be a left 
inverse of A. Then 

C := BS = BA(I - A)' 1 diag(a m , . . . , <7„ fc )IIi = x (/ - A)" 1 diag((7 r)1 , . . . , <7„JIIi. 

Consider a topological ordering of the induced DAG on the hidden nodes. In such an ordering, for 
every directed edge (j, i), we have j < i. Hence, A would be a lower triangular matrix in a topological 
ordering. We proceed by reordering the rows and the columns of C to get a lower triangular matrix. 
This may be done in many different ways but we show that all possible permutations that make C 
lower triangular correspond to different topological orderings of the same DAG. Therefore, we can 
choose any such permuted version of C, call it C. Then there exists a topological ordering with 
corresponding matrix A, such that, (/ — A) -1 diag(cr r)1 , . . . , a Vk ) = C and thus A = I — diag((7)C' _1 . 

Let Ri denote the set of rows in C with exactly one non-zero entry. In any lower triangular version 
of C, the rows in Ri should appear on top. Furthermore, their non-zero entries should appear in 
the first Ri columns. Note that rows in Ri correspond to hidden nodes with no parent. Obviously, 
any ordering of them with labels 1, . . . , |Ri| is faithful to topological orderings. Now, we can remove 
these nodes from the DAG (equivalently eliminate the Ri columns and rows from C) and repeat the 
same argument. Therefore, different permuted versions of C which are lower triangular correspond 
to different topological orderings of the DAG. This completes the proof. 
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A Proof of Remark 2.5 



Let M := M + Z. We first establish some definitions. 

Definition A.l. We call a vector fully dense if all of its entries are non-zero. 

Definition A. 2. We say a matrix has the Null Space Property (NSP) if its null space does not 
contain any fully dense vector. 

Claim A. 3. Fix any S C [k] with \S\ > 2, and set R := Nm(5). Let C be a \S\ x \S\ submatrix of 
M RtS . Then Pr(<5 has the NSP) = 1. 
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Now, we are ready to prove Remark 



2.5 



Proof (Remark 2.5). It follows from Claim A. 3 that, with probability one, the following event holds: 
for every S C [k] with \S\ > 2, and every \S\ x 15"! submatrix C of Mr,s, C has the NSP. Henceforth 
condition on this event. 

Now fix v G M. k with [|v[| > 2. Let S : = supp(w), := N M {S) and 5 := M B)S . Furthermore, 
let u£ (l\ {0})' 5 ' be the restriction of vector v to S"; observe that u is fully dense. It is clear that 
||Mu||o = ||-B-u||o, so we need to show that 



\Bu\\ > \R\ - \S\. 



(35) 



Suppose for sake of contradiction that Bu has at most \R\ — \S\ non-zero entries. Then there is 
a subset of | *S' | entries on which Bu is zero. This corresponds to a IS") x \S\ submatrix of B which 
contains u in its null space, which means that this submatrix does not have the NSP — a contradiction. 
Therefore we conclude that Bu must have more than \R\ — \S\ non-zero entries. □ 



Proof (Claim\AJ$. Let s = \S\ and let C 
1T and W 



[c~i\c~2\ • • • |c s ] T , where cj is the i-th row of C. Also, 
let C := [ci | C2 1 • • • \c s ] T and W := [u>i|u>2| • • • \ w s ] T be the corresponding submatrices of M and Z, 
respectively. For each i G [s], denote by Mi the null space of the matrix (7, = [ci | C2 1 • • • |cj] T . Finally 
let No = M s . Then, No 5 N 5 • • • 5 N s . We need to show that, with probability one, N s does not 
contain any fully dense vector. 

If one of N does not contain any full dense vector then we are done. Suppose that N contains 
some fully dense vector v. Since C is a submatrix of M^s, every row cj +1 of C contains at least one 
non-zero entry. Therefore 

v T c i+ i = ^2 V (j)Ci+l(j) 
j£[s]:ci +1 (j)^0 

where {u)i + i(j) : j G [s] s.t. a+i(j) / 0} are independent random variables (from Z). Moreover, 
they are of c\, . . . , c\ and thus of v. By assumption on the distribution of the Wi+i(j), 



v G N i+ i 
Consequently, 



Cl,C 2 , . .. ,Ci 



^2 v{j){c i+1 (j) + w i+1 {j)) = 

lje[s]:c i+1 (j)ytO 



Cl,C 2 , . . . ,Cj 



0. (36) 



dim(A/i+i) < dim(A/i) 



ci,c 2 



, . . . , 



(37) 



for all % = 0, . . . , s — 1. As a result, with probability one, dim(A/" s ) = 0. 



□ 
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B Proof of Lemma 14.1 



For each I £ V, let Ui,Vi G IRl 7 !^ be any matrices such that UJAj and VJB are invertible. Then 
for any distinct /, J, K G V, 

AjBj = MBjVj^BjVjyHu^AKyHu^A^Bj 

= AjBJVj^UJcAkBJVj^UJcAkBJ. (38) 

Notice that for any distinct I,JeV, Cj t j = AjBj. Since Aj and Bj have rank k, so does Ci t j. 
Let Uj G IRl^*^ and Vj G Rl^l xfc be respectively the matrices of left and right singular vectors of 
Cj j (corresponding to non-zero singular values). Since Ui and Aj have the same range, it follows 



that UJ Aj is invertible. Similarly VJ Bj is invertible. Using identity (38), we obtain 

AjBj = C I ,jV J {UlY, K ,jVj)- 1 UlC K>I , (39) 

for any distinct I,J,K G V . Therefore D can be determined as Djj = Ci j — AjBj for I G V and 
L = AB T is subsequently determined as L = C — D. 



C Proof of Lemma 4.5 



*(C) = E[xx T {rj, x)] = E[(Mij + e)(rj T M T + e T )(r?, Mrj + e)] 
= E[(Mr ? r/ T M T + ee T + Mjje T + er] T M T )(e T + r/ T M T )C] 
= E[ee T e T C + Mr]if M T (rj J M T '()} 
= E[ee T (e, 0] + ME[rjr] T (rj, M T ()}M T . 



(40) 



The proof is completed by showing that for any deterministic vector v G M fc , and any random 
vector z = (zi, . . . , z k ) with zero mean uncorrelated entries, we have 

E[zz T (z,v)} = diag(w)diag(/i^,. . . ,// 2 J. (41) 

We compute the diagonal and off-diagonal entries separately. 

E[z iZi {v,z)] = ViE[zf] + J2 v ^l E i z k\ = ViH H . (42) 

For j i 

E[z iZj (v, z)} = E[z iZj ^ v k z k ] = v t aliE[zj] + Vja 2 z ^E[ Zi ] + u fc E[zi]E[^]E[2; fc ] = 0. (43) 

k k^i,j 



D Proof of Remark 1^8 

Write 



E = AE[hh T ]A T + E[ee T ] . (44) 
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By Theorem 3.1 we can identify the columns of A, i.e., we can recover A = AYi\ for some 



permutation matrix IIi. Also, as demonstrated in the proof of Theorem 3.1 we can decompose £ 
into its low-rank part >1E[/i/i t ]^4 t and diagonal part E[ee T ]. Let B E M fcxn be a left inverse of A. 
Then, 

BAE[hh T ]A T B T = U^E[hh T ]U^ T . (45) 

Therefore, we have the second order moment of the hidden nodes (in some ordering of the nodes). 
Now consider k hidden nodes corresponding to the row (and columns of ) U^ 1 E[hh T ]IL^ T . Label 
these nodes with 1, . . . , k. Using the oracle we can find a permutation tt2 which puts the hidden 
nodes in a topological ordering. Let II2 be the corresponding permutation matrix to tt2- Then 
£ := n2lI^ 1 E[/i/i T ]n^ T nj is the second order moment of the hidden nodes in some topological 
ordering. By definition of a topological ordering, it is immediate to see that the coefficient matrix A 
is lower triangular in a topological ordering of the hidden nodes. Therefore, we can write 

S = (/- A)- 1 E[r ? r ? T ](/-A)" T , (46) 

where r] is the vector formed by the noise variables r/i (in the corresponding topological ordering) 
and A E M fcxfc is a lower triangular matrix with all diagonal entries equal to zero. Therefore, 

E 1 / 2 = (I- A)- 1 diagK, . . . , a m )Q, (47) 

for some rotation Q E ~R kxk . Notice that L := (I — A) -1 diag(cr^ 1 , . . . , cr^ fe ) is a lower triangular 
matrix with diagonal entries a m , . . . , 0^ which are all positive. Hence, using the LQ decomposition 
we can recover L. (Recall that the LQ factorization is unique if we require that the diagonal 
entries of the lower triangular part are positive) . 

Finally, diag(L) = diag((I - A)" 1 ) diag(cr r?1 , . . . , a rjk ) = diag(cr m , . . . , a Vk ). Therefore, A = I - 
diag(L)L _1 . The result follows. 



E Proof of Lemma 13.11 



Let A = USV T be a thin singular value decomposition of A, where U G M. nxk has orthonormal 
columns, S = diag(o"i(j4), . . . ,<Jk(A)), and V E M. kxk is an orthogonal matrix. Fix a partition index 
» 6 [f]. Let zi, Z2, ■ ■ ■ , z n E {0, 1} be independent indicator random variables such that z% = 1 iff row 
i of A is included in A v . Note that 

A T V A V = A T diag(zi, z 2 , ■ ■ ■ , z n )A 

= z l A T e i e T i A = VS(^ Zi U T aej U) SV T . V ' 

i=l i=l 

Therefore 

n n 

a k {A v ) 2 = X min (AlA v ) > X min (S) 2 ■ A min (^ z^aeJU) = a k (A) 2 ■ A min (^ X t ), (49) 

i=l i=l 

where X % := ZiU T eieJU E R kxk . Notice that < X { and 

A ma x(A,) < ||[/ T e,|| 2 < - CA . (50) 

n 
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Moreover, 



n n 

Y^^Xi = J2 ¥ ^ = ^WeieJU = -U T U = -I. 



(51) 



i=l 



i=l 



By Lemma E.l 



L i=l J 



~ < k • e -(3/4) 2 /(2fcAfc/n) < d/£j 



(52) 



where the last inequality follows from the assumption on ca- Therefore by Eq. (49), Ok(A v ) > 
<Jk(A)/(2^/l), with probability at least 1 — 8/1. A union bound over all v G [£] completes the proof. 



Lemma E.l (Matrix Chernoff bound [50]). Consider a finite sequence {X{} of independent and 
symmetric k x k random matrices such that ^ Xi and A max (JQ) < r almost surely. Define 
Mmin := \mm(J2i EX i) ■ For an V e 6 [0, 1], we k»e 



min ^ < A; 



-e 2 Mmin/(2r) 



(53) 
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